Data Exploration and Visualisation

UrbanSound dataset

For this project we will use a dataset called Urbansound8K. The dataset contains 8732 sound excerpts (<=4s) of urban sounds from 10 classes, which are:

  • Air Conditioner

  • Car Horn

  • Children Playing

  • Dog bark

  • Drilling

  • Engine Idling

  • Gun Shot

  • Jackhammer

  • Siren

  • Street Music

The accompanying metadata contains a unique ID for each sound excerpt along with it’s given class name.

A sample of this dataset is included with the accompanying git repo and the full dataset can be downloaded from here.

Audio sample file data overview

These sound excerpts are digital audio files in .wav format.

Sound waves are digitised by sampling them at discrete intervals known as the sampling rate (typically 44.1kHz for CD quality audio meaning samples are taken 44,100 times per second).

Each sample is the amplitude of the wave at a particular time interval, where the bit depth determines how detailed the sample will be also known as the dynamic range of the signal (typically 16bit which means a sample can range from 65,536 amplitude values).

This can be represented with the following image:

Therefore, the data we will be analysing for each sound excerpts is essentially a one dimensional array or vector of amplitude values.

Analysing audio data

For audio analysis, we will be using the following libraries:

1. IPython.display.Audio

This allows us to play audio directly in the Jupyter Notebook.

2. Librosa

librosa is a Python package for music and audio processing by Brian McFee and will allow us to load audio in our notebook as a numpy array for analysis and manipulation.

You may need to install librosa using pip as follows:

pip install librosa

Auditory inspection

We will use IPython.display.Audio to play the audio files so we can inspect aurally.

import IPython.display as ipd

ipd.Audio('../UrbanSound Dataset sample/audio/100032-3-0-0.wav')

Visual inspection

We will load a sample from each class and visually inspect the data for any patterns. We will use librosa to load the audio file into an array then librosa.display and matplotlib to display the waveform.

# Load imports

import IPython.display as ipd
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_8_1.png
# Class: Car horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_9_1.png
# Class: Children playing 

filename = '../UrbanSound Dataset sample/audio/100263-2-0-117.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_10_1.png
# Class: Dog bark

filename = '../UrbanSound Dataset sample/audio/100032-3-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_11_1.png
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_12_1.png
# Class: Engine Idling 

filename = '../UrbanSound Dataset sample/audio/102857-5-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_13_1.png
# Class: Gunshot

filename = '../UrbanSound Dataset sample/audio/102305-6-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_14_1.png
# Class: Jackhammer

filename = '../UrbanSound Dataset sample/audio/103074-7-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_15_1.png
# Class: Siren

filename = '../UrbanSound Dataset sample/audio/102853-8-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_16_1.png
# Class: Street music

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
plt.figure(figsize=(12,4))
data,sample_rate = librosa.load(filename)
_ = librosa.display.waveplot(data,sr=sample_rate)
ipd.Audio(filename)
../_images/1 Data Exploration and Visualisation_17_1.png

Observations

From a visual inspection we can see that it is tricky to visualise the difference between some of the classes.

Particularly, the waveforms for reptitive sounds for air conditioner, drilling, engine idling and jackhammer are similar in shape.

Likewise the peak in the dog barking sample is simmilar in shape to the gun shot sample (albeit the samples differ in that there are two peaks for two gunshots compared to the one peak for one dog bark). Also, the car horn is similar too.

There are also similarities between the children playing and street music.

The human ear can naturally detect the difference between the harmonics, it will be interesting to see how well a deep learning model will be able to extract the necessary features to distinguish between these classes.

However, it is easy to differentiate from the waveform shape, the difference between certain classes such as dog barking and jackhammer.

Dataset Metadata

Here we will load the UrbanSound metadata .csv file into a Panda dataframe.

import pandas as pd
metadata = pd.read_csv('../UrbanSound Dataset sample/metadata/UrbanSound8K.csv')
metadata.head()
slice_file_name fsID start end salience fold classID class_name
0 100032-3-0-0.wav 100032 0.0 0.317551 1 5 3 dog_bark
1 100263-2-0-117.wav 100263 58.5 62.500000 1 5 2 children_playing
2 100263-2-0-121.wav 100263 60.5 64.500000 1 5 2 children_playing
3 100263-2-0-126.wav 100263 63.0 67.000000 1 5 2 children_playing
4 100263-2-0-137.wav 100263 68.5 72.500000 1 5 2 children_playing

Class distributions

print(metadata.class_name.value_counts())
children_playing    1000
dog_bark            1000
street_music        1000
jackhammer          1000
engine_idling       1000
air_conditioner     1000
drilling            1000
siren                929
car_horn             429
gun_shot             374
Name: class_name, dtype: int64

Observations

Here we can see the Class labels are unbalanced. Although 7 out of the 10 classes all have exactly 1000 samples, and siren is not far off with 929, the remaining two (car_horn, gun_shot) have significantly less samples at 43% and 37% respectively.

This will be a concern and something we may need to address later on.

Audio sample file properties

Next we will iterate through each of the audio sample files and extract, number of audio channels, sample rate and bit-depth.

# Load various imports 
import pandas as pd
import os
import librosa
import librosa.display

from helpers.wavfilehelper import WavFileHelper
wavfilehelper = WavFileHelper()

audiodata = []
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath('/Volumes/Untitled/ML_Data/Urban Sound/UrbanSound8K/audio/'),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    data = wavfilehelper.read_file_properties(file_name)
    audiodata.append(data)

# Convert into a Panda dataframe
audiodf = pd.DataFrame(audiodata, columns=['num_channels','sample_rate','bit_depth'])

Audio channels

Most of the samples have two audio channels (meaning stereo) with a few with just the one channel (mono).

The easiest option here to make them uniform will be to merge the two channels in the stero samples into one by averaging the values of the two channels.

# num of channels 

print(audiodf.num_channels.value_counts(normalize=True))
2    0.915369
1    0.084631
Name: num_channels, dtype: float64

Sample rate

There is a wide range of Sample rates that have been used across all the samples which is a concern (ranging from 96k to 8k).

This likley means that we will have to apply a sample-rate conversion technique (either up-conversion or down-conversion) so we can see an agnostic representation of their waveform which will allow us to do a fair comparison.

# sample rates 

print(audiodf.sample_rate.value_counts(normalize=True))
44100     0.614979
48000     0.286532
96000     0.069858
24000     0.009391
16000     0.005153
22050     0.005039
11025     0.004466
192000    0.001947
8000      0.001374
11024     0.000802
32000     0.000458
Name: sample_rate, dtype: float64

Bit-depth

There is also a wide range of bit-depths. It’s likely that we may need to normalise them by taking the maximum and minimum amplitude values for a given bit-depth.

# bit depth

print(audiodf.bit_depth.value_counts(normalize=True))
16    0.659414
24    0.315277
32    0.019354
8     0.004924
4     0.001031
Name: bit_depth, dtype: float64

Other audio properties to consider

We may also need to consider normalising the volume levels (wave amplitude value) if this is seen to vary greatly, by either looking at the peak volume or the RMS volume.

In the next notebook we will preprocess the data